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This  research,  carried  out  within  the  Personnel  Accession  and  Utili- 
sation Technical  Area  of  the  Amy  Research  Institute  (API)  , Includes  a 
representative  review  of  previbus  findings,  both  within  ch'*  Army  and 
otherwise,  on  the  validity  and  reliability  of  peer  e'  A*uatione.  The 
research  also  reviews  several  situational  or  oontexfuaj.  facturc  that 
should  be  considered  in  conducting  pear  evaluations. 

This  research  la  an  In-house  effort  and  Is  responsive  to  Amy  Project 
2Q162717A766  and  to  special  requirements  of  the  Office  of  Deputy  Chief 
of  Staff  for  Personnel. 
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BRIEF 


Requirement t 

To  review  previous  findings  on  the  validity  and  reliability  of  peer 
evaluations  as  well  as  various  situational  moderators. 


Procedure! 

Peer  evaluation  research  was  reviewed  from  the  four  major  perspec- 
tives of  evaluation  process,  methodology,  situational  factors,  and  valid- 
ity studies. 


Findings i 

studies  investigating  the  structure  and  nature  of  the  peer  evalua- 
tion process  have  generally  found  fairly  clear  factor  structure  across 
widely  varying  samples.  There  is  some  evidence  that  the  structure  may 
be  as  much  in  the  nature  of  the  rater  as  the  ratee.  A review  of  findings 
from  research  that  utilised  different  methods  indicated  little  evidence 
for  substantial  differences,  in  either  reliability  or  validity,  among 
techniques.  Further,  a review  of  the  documented  and  potential  effects 
of  situational  factors  impacting  on  the  evaluation  process  indicated 
that  users  of  peer  evaluation  should  be  aware  of  these  issues  in  design- 
ing programs . Research  generally  has  found  substantial  concurrent  and 
predictive  validity,  with  correlations  in  the  .30  to  .50  range,  but  with 
most  studies  limited  to  training  groups. 


Utilization  of  Findings) 

Several  issues  surrounding  peer  evaluations  remain  unresolved;  how- 
ever, uvidoncn  suggests  that  these  issues  can  be  resolved,  and  that  peer 
evaluations  are  a powerful  tool  in  discriminating  complex  human  behavior. 
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INTRODUCTION  j 

When  confronted  with  the  prospect  of  drawing  order  out  of  complex  .| 

human  behavior  in  the  equally  complex  world  of  work,  much  traditional  | 

behavioral  science  research  has  been  marked  by  two  primary  charaotaris- 
tics.  First,  heavy  reliance  has  been  placed  upon  human  evaluations  of 

other  human  beings.  Second,  this  evaluative  information  has  been  typi-  \ 

cally  gathered  from  a limited  observational  viewpoint,  that  of  a suparioi*  ;■ 

toward  a subordinate.  The  technique  presented  in  this  paper  does  not  j 

deviate  from  the  first  of  th(*«:e  characteristics?  it  does  rely  on  human  i 

evaluation  of  other  human  beings.  However,  it  goes  beyond  the  second  ij 

characteristic  by  gathering  evaluative  information  from  the  perspective  j; 

of  an  individual's  peers.  For  purposes  of  this  paper,  peers  are  opera-  | 

tionally  defined  thus:  (a)  they  have  some  common  purpose  or  frame  of  li 

reference  (e.g.,  members  of  the  same  work  group),  and  (b)  generally  \ 

speaking,  they  lack  a formally  recognized  authority  relationship  between  i; 

them.  Although  the  term  "peer  rating"  is  most  commonly  applied  to  this  ' 

technique,  the  present  paper  uses  the  more  generic  term  "evaluation,"  ,;j 

reserving  the  term  "rating"  for  one  particular  technique. 

A source  of  much  confusion  in  peer  evaluation  research  has  been  a 
lack  of  clarity  between  the  technique  and  the  dimension  or  characteris- 
tic evaluated.  Although  previous  work  reviewed  here  substantially  sup- 
ports use  of  peer  evaluation  as  a technique,  issues  surrounding  the 
particular  dimensions  evaluated  are  not  discussed  in  this  review. 

This  paper  contains  throe  relatively  complementary  sections.  First, 
a representative  nolection  of  typical  validity  research  is  reviewed, 
along  with  a brief  history  of  the  use  of  peer  evaluations.  The  second 
section  discusses  various  mathodological  is.sues  underlying  the  peer  eval- 
uation technique,  and  the  third  section  prusonts  several  situational  or 
contextual  factors  that;  can  affect  a peer  ovaluation  effort. 


VAMDITY  OF  T’EER  EVALUATIONS 

The  history  of  t lie  peer  evaluation  technique  can  be  traced  from  the 
seminal  work  of  Moreno  (1934)  and  the  development  of  the  sociogram  tech- 
nique. However,  the  history  of  the  technique  as  it  is  dealt  wi.th  here 
is  more  conveniently  traced  to  several  efforts  conducted  during  and  after 
World  War  VI  (.see,  for  oxtimple  Clarke,  1946;  u,s.  Ani\y  Research  Insti- 
tute, 1943;  Wherry,  I94r))  . One  of  the  earliest  investigations  published 
in  the  professional  literature  l.s  that  by  Williams  and  Leavitt  (1947)  . 


1 


since  that  time,  peer  evaluations  have  been  used  for  two  primary  pur- 
poses. The  first  of  these  purposes  is  evaluative  in  the  criterion  sense; 

V The  concern  is  in  judqing  the  extent  or  adequacy  of  some  individual  char- 

acteristic (e.g.,  leadership  effectiveness,  job  performance).  The  second 
purpose  is  evaluative  in  the  sense  of  gaining  information  with  which  to 
predict  some  future  outcome  (individual  potential,  motivation  to  wor)c, 
etc.).  Both  purposes  have  guided  the  efforts  in  research  as  well  as 
operational  settings,  although  typically  only  one  purpose  has  been  the 
I focus  in  any  given  situation. 

!;  Table  1 summarizes  the  results  and  major  characteristics  of  a repre- 

' sentative  sampling  of  studies  which  report  validity  information  for  peer 

evaluations.  This  overview  is  intentionally  not  exhaustive,  since  several 
\ other  more  specializod  ceviewc  are  available  elsewhere' (e.g. , Gibb,  1969; 

fv  Hollander,  1954a)  Boulger  & Coleman,  1964;  fi  Nadal,  1966).  Lindzey  and 

I Byrne  (1966)  have  also  presented  an  excellent  review  of  the  use  of  social 

'i.  choice  methodology  of  which  pear  evaluations  are  one  type. 

I 

There  are  several  noteworthy  features  in  Table  1.  First,  the  magni- 
h tude  of  the  validity  coefficients  is  generally  strong  in  both  concurrent 

j.  and  predictive  studies.  Peer  evaluations  have  shown  rather  strong  pre- 

diotive  ability  even  for  periods  up  to  5 years  (Hollander,  1965)  . Fur- 
i thermore,  in  those  studies  that  Inoluded  measures  in  addition  to  peer 

evaluations,  the  peer  evaluations  tended  to  have  the  highest  concurrent 
t or  predictive  validity. 

j. 

j Also,  the  majority  of  the  evidence  for  the  value  of  peer  evaluations 

has  been  gathered  in  a training  situation,  particularly  in  the  military 
i'  environment.  In  fact,  only  two  of  the  studies  in  Table  1 (Weltz,  1958; 

Downey,  Medland,  k Yates,  1976)  used  a sample  from  other  than  a training 
I or  educational  environment.  With  a few  exceptions,  most  evidence  has 

been  gained  from  pooplo  relatively  low  in  the  hierarchy  of  their  organi- 
zational sotting. 

A third  major  feature  of  Table  1 is  the  variety  of  dimensions  that 
peers  have  been  required  to  evaluate  and  the  variety  of  criteria  with 
which  peer  evaluations  have  been  related.  The  peer  evaluation  dimen- 
sions have  Included  leadership  potential,  personality  traits,  and  super- 
visory 8)cill,  to  name  but  a few. 


Table  1 (continued) 
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significant  group  differences  found 

p < .05. 

D < -01. 


Attempts  to  implement  peer  evaluation  programs  have  produced  an 
improsBive  array  of  findings.  However,  several  limitations  also  appear. 
For  instance,  there  is  only  minimal  evidence  of  the  validity  of  peer 
evaluations  among  individuals  at  organizationally  higher  levels.  There 
is  also  a limited,  but  growing,  amount  of  evidence  of  the  utility  of  peer 
evaluations  in  other  than  the  training  environment.  In  addition,  in 
studies  that  use  peer  evaluations  as  a predictor  of  a concurrent  or  fu- 
ture criterion,  virtually  all  the  validity  evidence  is  of  a bivariate 
variety.  Although  a number  of  studies  demonstrated  that  peer  evalua- 
tions are  often  the  be.st  single  predic  tor  from  among  several  predictors, 
no  research  was  found  that  attempted  to  determine  what  other  predictors 
might  account  for  unique  variance  along  with  peer  evaluations.  An  ex- 
ception to  this  preoccupation  with  the  bivariate  paradigm  is  occasion- 
ally found  in  assessment  center  methodology.  Mackinnon  (1975)  has  else- 
where presented  a comprehensive  review  of  assessment  centers,  but  even 
in  assessment  centers  with  a wealth  of  information  available,  the 
differential  validity  of  peer  evaluations  has  not  always  been  adequately 
addressed. 


METHODOLOGICAL  ISSUES 

Peer  evaluations  have  been  performed  by  means  of  four  primary  tech- 
niques! ratings,  rankings,  full  nominations,  and  high  nominations.  The 
general  paradigm  of  the  rating  technique  calls  for  a group  member  to  pro- 
vide a rating  of  the  relative  amount  or  degree  of  the  dimension  under 
consideration  possessed  by  every  other  group  member.  The  ranking  pro- 
cedure simply  requires  each  group  member  to  rank -order  all  other  group 
members  from  high  to  low  (or  some  other  relevant  continuum)  an  the  dimen- 
sion under  consideration.  The  full  nomination  technique  requires  that 
each  group  member  choose  a specified  number  or  proportion  of  the  group 
as  being  either  high,  medium,  or  low  on  a given  dimeiision.  The  minor 
variation  of  this  technique  in  which  nominations  of  the  middle  are  not 
required  is  also  referred  to  as  full  nominations.  However,  the  case  in 
which  only  high  nominaticr.a  are  elicited  is  reserved  as  a discriminably 
different  technique,  for  reasons  to  bo  elaborated  upon  ir,  later  portions 
of  the  paper. 
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Several  variations  b.Tsod  on  combinations  of  these  basic  techniques 
are  forced  distribution  rankings,  or  combinations  of  rankings  with  rat- 
iT!gs.  General  scoring  algorithms  for  the  four  primary  techniques  follow. 
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All  theaa  techniques  produce  oeores  with  means  independent  of  group 
also,  with  the  exception  of  the  ranking  formula,  in  which  ease  adjustment 
must  be  made  for  group  sizes  greater  than  100.  The  standard  deviation  of 
the  various  scores  is  a function  of  the  reliability  (consistency)  of  each 
group's  evaluations;  Gordon  (1969)  and  Willingham  (1959)  dual  with  gen- 
eral issues  related  to  reliability.  Also,  for  a group  using  either  a 
ranking  or  nomination  technique,  the  average  score  is  determined,  the 
average  score  using  the  rating  technique  is  free  to  vary. 


Metric  and  l?lstributior. 

The  metric  and  dist  rihutional  properties  of  asBociata  evaluations 
are  directly  rolatod  to  the  particular  technique  amploved.  With  respect 
to  scaling  properties,  the  rankings  and  both  nomination  procedures  pro- 
duce an  ordinal  scale  (Stovons,  1951).  The  ratings  from  an  evaluator 
are  the  most,  nearly  equal  interval  data,  although  here  also  it  can  be 
argued  that  these  are  merely  an  ordinal  scale.  The  scaling  properties 
of  the  aummated  scores  from  the  various  techniques  approximate  interval 
data  as  the  niunber  In  the  evaluation  group  increases. 
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The  four  common  procedures  will  generally  produce  different  distri- 
butions, examples  of  which  are  displayed  in  Figure  1.  csiven  the  rela- 
tively free  response  mode,  ratings  will  often  produce  negatively  skewed 
distributions  largely  because  group  norms  tend  to  inflate  any  evaluative 
procedure.  The  ranking  procedure,  if  it  were  perfectly  reliable,  would 
produce  a rectangular  distribution  with  one  person  at  each  rank.  Gener- 
ally, less  than  perfectly  reliable  rank  scores  will  tend  to  he  normally 
distributed,  with  very  unreliable  scores  producing  a more  leptokurtic 
curve,  and  a perfectly  unreliable  procedure  producing  a point  di.«itrlbu- 
tion  with  everyone  rocuivJng  .‘in  average  rank  equal  to  tho  middle  rank. 
Full  nomination  scores  produce  a distribution  which,  if  perfectly  reli- 
able, is  trimodal,  with  one  group  receiving  all  high  nominations,  another 
group  all  low  nominations,  and  the  remainder  middle  nominations  or  none 
at  all.  High  noml nations  pvuauce  a bimodal  distribution  (not  shown  in 
Figure  1) . 


Basis  of  Comparison 

Scores  resulting  from  tho  four  primary  techniques  vary  along  another 
important  dimQnsioii--the  evaluative  procoss  evoliod  in  the  evaluator  upon 
which  Judgments  aro  made.  Druoker  (19S7)  initially  pointed  out  the  du- 
ality of  focus  with  which  peer  evaluations  can  be  executed!  whether  the 
frame  of  reference  or  standard  upon  which  the  evaluations  are  made  is  in- 
ternal or  enternal  to  the  group.  In  one  case,  the  evaluator  compares 
the  particular  Individual  against  a frame  of  reference  external  to  the 
group  and  assigns  the  individual  to  a category.  In  the  second  case,  the 
evaluator  compares  the  particular  individual  against  a frame  of  refer- 
ence internal  to  the  group  and  makes  a judgment  of  more  or  less,  and 
assigns  the  individual  to  the  appropriate  category.  The  external  process 
can  bo  used  only  with  the  rating  procedure.  The  internal  process  can 
also  be  used  with  ratings;  with  rankings  and  nominations,  it  is  required. 
The  internal  procasH,  in  general,  requires  o moderate  number  of  individ- 
uals in  the  group  (more  than  five) . The  direct  implication  of  this  dis- 
tinction is  that  the  exti.u.nal  frame  of  reference  allows  both  comparison 
between  individuals  acrosr.  peer  groups  and  tlie  comparison  of  peer  groups. 
The  internal  process  doew  not  allow  comparison  between  individuals  across 
peer  groups  unless  Die  assumption  is  accepted  that  the  groups  aro  equal 
on  the  p.irticular  ability,  trait,  or  behavior. 

A corollary  of  this  implication  is  that  population  norms  can  be 
developed  only  through  tho  use  of  a r.ating  procedure  and  an  external 
frame  of  refc^renci^,  .igain  unless  group  equality  is  assumed  or  assured. 


Reliability 
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The  reliatnlity  of  associate  evalviutions  has  generally  been  deter- 
mined by  one  of  two  motliods,  estimation  ot  internal  consistency  or  test- 
rotoBt  correlation.  Hot))  mcthod.s  are  analogous  to  the  same  procedures 
i»)  classical  torit  theory  (Ixird  (4  Novick,  1968). 
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The  Internal  oonslatenoy  of  peer  evaluations  la  the  degree  to  which 
members  of  a peer  group  agree  with  one  another  when  obaervlng  an  individ- 
ual In  a similar  situation  and  at  the  sama  time.  Using  the  multiple- 
choice  test  paradigmi  the  evaluators  are  comparable  to  test  items  and 
those  who  are  being  evaluated  are  comparable  to  persons  taking  the  test. 
Although  Gordon  (1969)  has  recommended  the  use  of  the  alpha  coefficient 
for  estimating  the  Internal  consistency  or  reliability  of  peer  evalua- 
tions, the  most  common  procedure  has  been  a split-half  (or  group)  esti- 
mate. The  split-half  estimate  is  made  by  randomly  assigning  peer  group 
members  to  one  of  two  groups,  computing  scores  in  each  group  for  all 
group  members,  and  then  correlating  the  scores  for  each  rates  from  each 
group  (see  Hollander,  1957,  & Downey,  1974).  The  correlation  coeffi- 
cient is  than  adjusted  for  total  group  size  using  the  Spearman-Brown 
formula  (Gulliksen,  1950) . If  small  groups  are  used,  a random  split 
may  not  bo  possible,  and  some  technique  for  averaging  the  intercorrela- 
tions between  evaluators  could  be  used  (Gulliksen,  1950) . 

The  test-retest  method  of  estimating  reliability  requires  that 
group  members  evaluate  each  other  at  two  different  times.  Scores  from 
the  two  different  evaluations  are  then  correlated.  Examples  of  this 
type  of  estimate  are  given  in  Hollander  (1957)  and  Downey  (1974,  1976) . 
Perhaps  the  most  rigorous  examination  of  reliability  was  done  by  Gordon 
and  Madland  (1965) , in  which  they  varied  both  time  of  administration  and 
group  doing  the  evaluations  and  found  reliability  oooffioients  in  the 
ao's. 


Research  has  generally  demonstrated  the  reliability  of  peer  evalua- 
tions to  be  in  the  .70  to  .90  range,  regardless  of  the  type  of  reliabil- 
ity estimate  employed.  Rescaroh  comparing  the  various  evaluative  method- 
ologies is  rare  but  has  generally  supported  the  view  that  all  four  methods 
are  quite  similar,  with  perhaps  a slight  advantage  to  ratings  (Suoi, 
Vallance,  s Glickman,  1954)  Downey,  1974)  Hammer,  1963) . Evan  the  use 
of  a paired  comparison  procedure  does  not  significantly  Improve  reliabil- 
ity (Bolton,  1971). 


Acceptability 

A major  factor  in  the  success  or  failure  of  any  peer  evaluation 
procedure,  whether  for  operational  or  research  purposes,  is  the  degree 
to  which  participants  accept  the  purpose  of  the  evaluations.  Accept- 
ability is  generally  studied  as  a specific  issue  of  the  particular  pro- 
gram under  investigation  rather  than  comparative  analyses  of  acceptabil- 
ity across  techniques  or  situations.  There  ie  therefore  little  formal 
evidence  of  differences  between  techniques  in  this  respact,  but  Infer- 
oncea  c.An  be  drawn  from  the  particular  qualities  of  the  technique. 
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A major  factor  in  the  aaoaptablllty  of  a taohnlque  is  the  degree  of 
perceived  difficulty.  From  this  point  of  view,  both  the  rating  and  rank- 
ing of  large  numbers  of  Individuals  (more  than  20)  can  be  time-consuming 
and  makes  for  difficult  discriminations,  particularly  among  group  members 
who  are  more  or  less  average  on  the  particular  dimension.  On  the  other 
hand,  the  nomination  procedure  allows  the  Individual  to  place  a large 
number  of  people  in  a desired  category  and  does  not  require  such  diffi- 
cult discriminations. 

The  rating  procedure  is  quite  acceptable  to  the  raters  where  the 
rated  group  is  small  and  cohesive.  The  full  nomination  technique  is  ac- 
ceptable to  the  nominators  for  moderate-size  to  large  groups  in  which 
not  all  individuals  are  well  known  to  one  another.  The  high  nomination 
technique  is  even  more  acceptable  because  it  does  not  require  an  individ- 
ual to  make  negative  evaluations. 

Another  determinant  of  the  degree  of  acceptability  is  the  degree  to 
which  group  members  are  knowledgeable  about  the  evaluation  procedure, 
process,  background,  and  use.  Downey  (1975)  found  that  acceptability 
Improved  as  a function  of  an  educational  program.  Two  different  con* 
siderations  were  noted i (a)  the  degree  to  which  peer  evaluations  were 
felt  to  be  valuable  and  accurate  estimates  and  (b)  the  degree  to  which 
the  evaluations  were  acceptable  for  particular  uses.  Downey  also  found 
that  a person's  peer  evaluation  score  and  degree  of  acceptance  of  the 
peer  evaluation  process  were  positively  correlated)  larger  correlations 
were  found  in  the  group  who  knew  less  about  the  peer  evaluation  process . 


Feasibility 

Closely  linked  with  the  concept  of  acceptability  is  feasibility, 
or  costs  associated  with  the  implementation  and  execution  of  a particu- 
lar peer  evaluation  system.  The  major  costs  associated  with  a peer  eval- 
uation system  are  (a)  preparation  of  evaluation  materials,  (b)  adminis- 
tration time,  and  (c)  scoring  cost.  Prior  to  the  advent  of  automatic 
data  processing  procedures,  the  costs  associated  with  use  of  any  peer 
evaluation  system  in  large  groups  or  on  a large  scale  were  prohibitive. 
Merely  in  terms  of  bits  of  information  collected,  it  can  be  seen  that 
the  number  of  evaluations  is  typically  equal  to  n (n  - 1)  whore  n Is  the 
number  in  the  group.  Thus,  peer  evaluation  systems  arc  relatively  costly 
efforts,  which  typically  require  more  than  minimal  sophistication  with 
data  processing  procoduros.  Unfortunately,  little  systematic  information 
on  cost  is  available. 


SITUATIONAL  FACTORS 

In  addition  to  the  methodological  concerns  of  the  varioua  techniques, 
several  situational  or  contextual  factors  can  affect  a peer  evaluation 
system,  often  without  regard  to  the  specific  technique  under  discussion. 
These  factors  Include  group  size,  informal  group  structures,  demographic 
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oharactaristlct} , group  boundaries,  hierarchical  characteristics,  friend*- 
ships,  length  of  association,  and  types  of  interaction. 


Group  Size 

Very  few  attempts  have  been  made  to  study  the  independent  effects 
of  group  sire.  More  often  than  not,  what  evidence  there  is  has  been 
reported  as  a byproduct  in  research  directed  uleawhero.  for  example, 
Downey,  Modland,  and  Yatos  (1976)  used  a peer  nomination  technique  with 
groups  of  Army  colonels  in  14  career  groups  that  varied  in  size  from  22 
to  321.  Kol lability  coofficlonts  varied  from  .63  to  .94  and  the  rank 
order  coefficient  between  group  .tize  and  reliability  was  .03.  Downey 
(1976),  in  a sample  of  Army  Pan-jore,  compared  peer  ratings  collected 
within  squads  ( n ^ lO)  with  peer  nominations  colleoted  on  the  some  men 
within  platooim  (-:  * 40).  Coefficients  between  the  two  scores  were  in 
the  .60*8.  Ho.,'  7<.  r,  platoon  sooros  wore  both  more  reliable  and  more  pre- 

dict.ive  of  job  performance. 

As  mentioned  previously,  from  the  standpoint  of  feasibility  both 
ratings  and  rankings  would  naem  to  bo  most  appropriate  for  relatively 
small  group  aizos  (approximately  a dozon) , whereas  the  nomination  tech- 
nique is  virtually  mandatory  for  largo  groups  (more  than  SO) . Prom  the 
standpoint  of  empirical  results.  It  appears  that  small  groups  may  produce 
somewhat  unreliable  scores,  with  reduced  validity.  Alternatively,  al- 
though It  la  rational  to  believe  that  there  Is  an  optimal  upper  size 
peer  group,  scant  evidence  exists  to  support  this  view. 


Informal  Group  Sirup  tvira^ 

Within  any  form.^lly  defined  qrou].',  there  may  exist  one  ot  more  in- 
formal subgroups  definud  by  sumo  sort  of  mutuo)  self-interest.  The  issue 
then  arises  aa  to  the  of  feet  thoin;  informal  nubgroups  may  have  on  a peer 
evaluation  proerdure  coiidn''ted  in  tho  total  group. 

The  worst  uasc  would  ou  one  In  which  two  equal-sized  informal  Mub- 
qroups  existed  within  a total  group,  and  each  group  member  was  exclu- 
sively in  one  iiubgrovip  or  tho  other.  In  inich  a situation,  one  or  both 
subgroups  might  makr-  thoir  ovaliiationu  solely  on  the  basis  of  subgroup 
membership,  i.u.,  on  a bnnia  other  than  tlie  one  intended.  The  not  ef- 
fect of  svjch  behavior  is  to  attenuate  the  validity  of  the  peer  evalua- 
tion procedure,  attanuatioii  Is  most.  t>roiiouncod  when  both  subgroups  engage 
in  such  beluivior.  The  effect  dlminislu'.M  if  one  of  the  groups  does,  in 
fact,  provide  (.'Valuations  over  the  whole  group  on  tho  dimension  intended. 
The  effi'ct  nlfio  diml ni.slu's  as  infonn.il  subgroup  size  decroasea  or  as  the 
nuinlier  of  uubcirinipM  liun’.tnoi-i . 


In  tomiB  of  taohnique,  tha  effaot  of  aubqroup  bahavior  la  pronounoad 
If  ratings  or  rankings  ara  usad.  Resultant  scores  are  most  likely  to  be 
negatively  skewed.  Tha  use  of  full  noninations  will  tend  to  produce  scores 
with  decreased  variance,  and  high  nominations  will  produce  the  worst  case 
with  a drastic  reduction  in  variance.  An  Important  point  whan  using  nomi- 
nations is  that  the  use  of  too  many  nominations  relative  to  total  group 
sise  may  increase  tha  effect  of  subgroup  behavior  (sea  Downey,  1974). 

Zt  is  clear  that  subgroups  of  sufficient  size  can  have  an  effect 
upon  the  final  scores.  The  problem  is  the  incidence  of  such  effects  and 
whether  there  exists  a mechanism  for  detecting  them.  If  the  evaluation 
process  is  part  of'  an  ongoing  process,  the  simplest  procedure  for  checking 
for  these  problems  lu  the  repetitive  production  of  reliability  indices 
as  part  of  the  procedure  for  rroduclng  peer  sooras.  If  the  reliability 
coefficients  were  to  drop  below  .60,  it  would  probably  indicate  a prob- 
lem, and  care  should  be  taken  in  use  of  the  evaluations.  Alternatively, 
a two-way  analysis  of  variance  design,  one  factor  being  the  type  of 
raters  and  the  other  factor  being  the  same  type  of  ratees  could  be  used. 

If  a significant  interaction  were  found,  then  a strong  case  could  be  mads 
for  considering  the  peer  scores  as  at  least  partially  the  result  of  group 
membership. 


Demographic  Characteristics 

The  use  of  peer  evaluations  with  their  reliance  upon  fallible  human 
observers  immediately  raises  the  possibility  of  racial  and  sexual  bias 
on  the  part  of  evaluators.  This  concern  is  especially  crucial  in  view 
of  recent  problems  associated  with  demonstrating  the  absence  of  bias  in 
employment  selection  and  classification  measures  as  well  as  in  criterion 
measures . 

The  evidence  concerning  racial  bias  in  peer  evaluations  is  mixed  and 
inconclusive.  In  a study  dealing  with  Air  Force  recruits,  Cox  and 
Krumbolts  (1956)  found  that  subjects  were  rated  higher  by  members  of 
their  own  race,  but  the  utfect  varied  across  groups,  and  there  was  sub- 
stantial agreement  on  rank  order  acrose  racee  (r  ■■  .76).  They  concluded 
that  any  bias  was  far  from  complete  and  suggasted  that  prior  acquaintance- 
ship of  group  members  might  account  for  the  differences.  In  a similar 
study  in  the  Army,  deJung  and  Kaplan  (1962)  found  similar  results i Rat- 
ings differed  as  u function  of  the  rater's  race.  However,  an  analysis 
of  covarlanco  adjusting  for  a combined  interest  and  math  score  showed 
that  whites  did  not  give  hlghor  adjusted  scores  to  whites  or  blacks, 
but  that  blacks  gave  higher  adjusted  scores  to  blacks.  Results  were 
interpreted  in  terms  of  asnignmont  of  higher  scores  to  close  acquain- 
tancoH — a rtmult  had  most  impact  upon  blacks  rating  blacks  (because  of 
tlio  smaller  group  size)  . 
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in  a more  recent  study  In  an  Industrial  training  context)  Schmidt 
and  Johnson  (1971)  used  a forced-ahoice  rating  distribution  in  groups 
made  up  of  approximately  equal  numbers  of  blac)cs  and  whites.  No  dif- 
ferences due  to  race  ware  found. 

The  evidence  suggests  that  peer  evaluations  can  be  subject  to  racial 
bias,  but  the  effect  is  perhaps  more  strongly  related  to  the  interaction 
between  friendship  or  acquaintanceship  and  the  particular  evaluation 
method  used  than  to  the  fact  of  race  itself.  The  presence  of  substan- 
tial correlation  between  the  rank  orderings  from  each  race  indicates 
that  the  ordering  was  not  much  affected  by  race.  But  the  use  of  ratings 
allows  evaluators  to  assign  unrelated  scores  to  individuals  whom  they 
consider  special  in  some  way. 

In  terms  of  sexual  bias,  Mohr  and  Downey  (1977)  recently  reported 
results  from  a small  sample  of  Army  officers,  in  which  females  scored 
lower  than  males  on  evaluations  received  from  both  males  and  females. 

If  bias  occurred,  it  was  on  the  part  of  both  groups,  An  interesting 
finding  was  that  females'  self-ratings  were  not  related  to  either  male 
or  female  evaluations,  but  males'  self-ratings  were  related  to  these 
evaluations. 

This  admittedly  small  number  of  studios  appears  to  indicate  that 
differences  based  upon  race  and  sex  can  occur,  but  does  not  make  clear 
whether  these  differences  are  attributable  to  race  or  sex  group  differ- 
ences, to  interaction  patterns  (e.g,,  friendships),  to  the  specific 
methodology,  or  to  some  combinations  of  these  factors.  It  would  cer- 
tainly be  safe  to  say  that  researchers  should  be  sensitive  to  the  poten- 
tial for  such  bias. 


Group  Boundaries 

The  discussion  of  peer  evaluations  has  proceeded  to  this  point  as 
if  it  were  clear  just  what  is  meant  by  a peer  or  associate  groujj.  Most 
researchers  report  their  procedures  in  sufficient  detail  to  show  the 
general  characteristics  of  the  groups  in  the  study.  However,  given  the 
variety  of  ovorlapplm  and  higher  order  groups  in  most  real-life  settings, 
the  issue  becomes  that  of  defining  some  basic  guidelines  for  selecting 
the  appropriate  rating  group.  It  is  clear  that  the  selection  of  the 
evaluative  group  oan  bo  )ftected  by  such  factors  as  length  and  type  of 
interaction,  formal  organizational  structure,  informal  group  structure, 
friendship  patterns,  .ind,  of  course,  the  particular  dimension  being 
evaluatu'd . 


Thoto  .ire  few  empirical  findings  to  guide  aolection  of  the  peer 
group,  Kather,  guide) intsa  must  be  bent  guesses  based  on  partial  infor- 
mation from  related  data. 
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In  a 1976  study,  Downey  found  that  platoon  evaluations  produced 
more  reliable  and  slightly  more  valid  scores  than  did  squad  evaluations, 
but  the  differences  were  potentially  confounded  by  differences  in  method 
and  group  size.  Gordon  and  Medland's  1965  study,  in  which  individuals 
were  evaluated  at  two  different  times  by  totally  different  groups,  indi- 
cated a high  degree  of  stability  across  the  two  evaluations.  Even  the 
method  used  to  compute  reliability  indices,  random  splits  of  the  primary 
group,  supported  the  notion  that  group  composition  can  be  drastically 
altered  without  giving  rise  to  major  problems  in  the  reliability  and 
validity  of  scores. 


Hierarchical  Characteristlns 


A concept  related  to  that  of  group  boundaries  is  that  of  hierarchies. 
Suppose  one  were  to  perform  a peer  evaluation  procedure  in  a traditionally 
hierarchical  organization.  If  work  groups  at  the  subordinate  level  are 
chosen  as  the  peer  groups,  what  effect  does  inclusion  of  their  immediate 
superiors  have  on  the  resulting  evaluations?  Conventional  wisdom  tends 
to  hold  that  inclusion  of  such  individuals  can  contaminate  the  procedure , 
and  therefore  they  should  be  excluded  from  the  worker  peer  groups  and  in- 
cluded in  a peer  group  of  first-level  supervisors. 

Again,  results  bearing  upon  hierarchical  inclusion  are  mixed.  Re- 
search by  Lovi,  Torrance,  and  Plotts  (1958)  indicated  no  affects  from 
including  the  formal  leader  in  the  peer  evaluation  process.  Research 
by  Downey  in  1975,  in  which  the  leaders  of  small  combat  units  were  in- 
cluded in  the  peer  nomination  process,  indicated  that  the  leaders  spanned 
the  full  range  of  peer  evaluation  scores.  There  was  a positive  relation- 
ship between  formal  position  and  peer  evaluation  scores  of  leadership 
potential  (as  there  should  be,  if  the  original  selection  procedure  for 
leaders  had  any  validity).  These  data  were  experimental,  and  the  intro- 
duction of  an  operational  system  might  change  the  result. 

A rational  solution  to  the  boundary/hierarchical  problem  should  be 
guided  by  the  following  auggestlonsi 

1.  The  grouf'  selected  should  be  large  enough  to  overcome  problems 
associated  with  primary  groups. 

. The  group  should  not  bo  so  larqe  as  to  include  subgroups  who 
m.iy  bo  relnt. ively  unknown  to  each  other  or  may  be  competing  for 
similar  resources  and  towards. 

1.  Tho  fuhr.'t  iim  of  tlio  nroup  soloctod  should  bo  reasonal’ly  related 
to  till’  dtmonsioM  to  bo  cvaluaiod)  e.tt.,  if  evaluation  of  leader- 
ship in  .1  work  .sotting  is  dusirod,  a work  group  and  not  a social 
gronp  should  bo  selected. 
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Fr.lendflhip 

I'rJ  (itidrihip  h«B  boon  a major  ro.search  isaua  in  the  history  of  peer 
uvaluutionM.  Auaordinq  to  folklorrs,  peer  evaluations  are  the  product 
of  friendshli)  or  popularity  and  are  therefore  not  valid  Indioations  of 
the  dimension  under  consideration.  The  impact  of  this  bit  of  folklore 
has  been  that,  with  the  exception  of  simple  validity  studies,  this  is 
probably  the  single  most  researched  question  associated  with  pear 
evaluations. 

Wherry  and  Fryer  (1949)  were  the  first  to  address  the  issue  of 
friendship  in  peer  ratings.  They  reported  that  although  there  was  a 
moderate  degree  of  relationship  botween  friendship  and  a leadership  cri- 
terion, the  major  portion  o'  the  predicted  oritarion  variance  was  inde- 
pendent of  friendship.  They  concluded  that  peer  evaluations  of  leader- 
ship are  not  popularity  contests.  Studies  by  Gibb  (19S0)  and  Horrooks 
and  Wear  (1953)  in  collaga  eamplaa  supported  Wherry  and  Fryer's  findings. 
Borgatta  (1954)  also  reported  that  leadership  and  popularity  evaluations 
were  related,  but  he  failed  to  draw  any  oonoluaione.  Several  othar  in- 
vestigations have  documented  a moderate  degree  of  relatlonihip  between 
friendship  and  pear  evaluations  of  leadership  (Hollander,  1956}  Hollander 
a Webb,  1935}  Theordoraon,  1957)  . 

Downey  (1974)  presunted  evidence  that  the  use  of  full  nominations 
(with  small  numbers  of  high  and  low  nominations  required)  reduced  the 
correlation  between  friendship  and  leadership  evaluations  compared  with 
forced  distribution  ratings. 

It  seams  that  when  an  evaluator  is  faced  with  the  task  of  evaluat- 
ing several  people,  some  of  whom  he  or  she  considers  friends,  the  eval- 
uator will  tend  to  select  a friend  rather  than  another  person  considered 
to  be  of  equal,  or  at  least  indistinguishable,  merit.  Therefore,  the 
variance  associated  with  friendship  may  be  a source  of  systematic  error 
primarily  in  the  middle  of  the  distribution.  This  syatematic  error  var- 
iance will  increase  in  large  groups,  in  which  soms  members  are  relatively 
unknown  to  each  other  or  the  interaction  patterns  are  not  fully  estab- 
lished for  all  members. 

However,  in  spite  of  tlio  impressive  array  of  research  findings  as 
to  the  minimal  effect  of  friendship,  the  "popularity  contest"  issue  re- 
mains tlie  argument  moat  consistently  offered  against  the  use  of  peer 
ovaluationa  in  an  operational  setting. 


Liongt.li  of  l I i on 

Whim  I'l'i't  ovaiuiitions  art’  considered  for  use  in  any  situation,  an 
Important  question  in  how  long  group  mombors  must  be  associated  with 
each  other  bnfort*  they  can  provide  roilable  and  valid  evaluations.  This 
iesue  in  often  ralHcd  in  the  context  of  transient  training  groups. 
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Res««rah  fairly  oonalatantly  finds  that  peers  can  make  reliable  and 
valid  evaluations  after  a relatively  short  period  of  time — typically 
3 to  6 weeks  (Hollander,  1957) . 

Subsidiary  to  the  overall  issue  is  the  effect  of  including  a new 
group  member  in  an  intact  group.  Mayfield  (1975)  has  suggested  that  in 
such  a situation  there  may  be  reason  to  suspect  that  a longer  period  of 
acquaintanceship  is  necessary  for  sufficient  integration  into  the  group. 

A more  generalised  way  of  approaching  the  question  is  to  determine  which 
person  is  known  or  not  well  known  to  other  members  of  the  group.  Evi- 
dence has  shown  that  an  individual  not  well  known  to  other  members  of  the 
group  will  typically  be  evaluated  as  near  the  middle  of  the  distribution 
of  peer  evaluation  scores  within  the  group  (Downey,  1974} . 

In  terms  of  technique,  a nomination  prooadure  is  most  liksly  to  ds- 
oreaae  the  error  variance  associated  with  acquaintanceship!  ratings  or 
rankings  tend  to  oapitalixe  on  the  error  variance  and  show  a greater  de- 
gree of  relationship  with  acquaintanceship. 


Type  of  intaraction 


Although  peer  evaluations  )tavs  bssn  used  and  rsportad  ovar  a span 
of  more  than  25  years,  they  have  baen  appliad  in  rather  limited  situa- 
tions. Most  of  the  research  has  been  conducted  with  junior  personnel  in 
a military  training  context  such  as  Offloar  Candidate  School  (OCS) . A 
recent  effort  to  use  a peer  nomination  prooees  in  a sanior  Army  offloar 
promotion  system  produced  supportive  results  (Downey,  Medlsnd,  & Yates, 
1976).  Outsids  the  military.  Waits  (1956)  end  subsequently  Mayfield 
(1970)  1975)  have  worked  in  industry  with  insurance  salesmen. 

Fresberg  (1969)  reports'^  a project  in  which  peer  evaluetione  were 
more  highly  related  to  a performance  criterion  when  the  intsraotlon  be- 
tween pears  was  relevant  to  tha  dimension  being  evaluated.  Bayroff  and 
Maohlin  (1950)  found  that  leadership  ovaluations  could  be  made  in  an 
aoadamic  environment  anu  were  highly  related  to  evaluations  made  after 
exposure  to  a situation  where  leadership  was  displayed.  Lewin,  Dubno, 
and  Akula  (1971)  indicated  that  video  tapes  supplied  sufficient  informa- 
tion for  reliable  evaluations  and  that  these  evaluations  were  highly  re- 
lated to  evaluations  from  group  membera. 

Until  more  extensive  research  is  conducted  in  broader  organisa- 
tional contexts  with  a wider  aalecr.ion  of  subject  populations,  the  gan- 
er'illty  of  tlie  peer  evaluation  process  in  largely  a matter  of  conjec- 
ture. However,  It  would  bo  safe  to  assume  that  peer  evaluations  of  a 
variety  of  complex  human  behaviors  can  bo  rondorod  reliably  after 
exposure  of  tho  poore  to  each  other  In  nituatlons  that  require  the 
individual  to  Inturuci:  uithar  with  tho  environment  or  with  othars  in 
relovatit  nlt  nations.  Further,  tin;  validity  of  the  evaluations  will  be 
a function  of  the  donreo  to  which  the  particular  behaviors  are  relevant 
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to  the  dimension  under  study.  Hollander  (1956)  found  that  reliable 
evaluationa  wore  given  after  1 hour  of  discussion  between  peers  In  a 
naval  OCS  class,  but  the  scorns  had  only  moderate  relationship  with 
evaluations  obtained  3 weeks  later,  and  were  even  leas  predictive  of 
eventual  job  performance.  This  convergence  of  views  by  peers  after  a 
short  period  of  exposure  is  probably  a function  of  similar  psychological 
maps  of  behavior  on  the  part  of  peers,  and  the  preliminary  evaluations 
are  subject  to  revision  based  upon  further  information.  Thera  seems 
to  be  little  advantage  in  using  one  evaluative  technique  over  another, 
so  long  as  the  technique  does  not  require  the  evaluator  to  make  finer 
diseriminatlons  than  are  possible,  baaed  on  the  type  of  interaction 
and  the  amount  of  Information  that  can  be  gathered  from  the  interaction. 


SUMMARY 

Researchara  have  used  the  peer  evaluation  technique  both  as  a cri- 
terion of  complex  human  behavior  and  as  an  index  of  future  potential. 

The  particular  dimension  measured  has  varied  considerably.  The  validity 
research  summarized  presents  an  impressive  array  of  findings  with  cor- 
relation coefficients  in  the  .30  to  .50  range  either  in  a concurrent  or 
a predictive  situation.  Research  on  extending  the  generality  of  the  peer 
evaluation  procedure  to  a more  diverse  sampling  of  peer  group  types, 
particularly  nontraining  groups,  has  been  limited. 

The  four  major  techniques  have  also  demonstrated  important  simi- 
larities and  differences  in  their  psychometric  properties.  For  example, 
only  ratings  can  produce  comparable  scores  across  different  groups  with- 
out extensive  assumptions.  Research  results  indicate  little  differences 
in  measurement  reliability  between  techniques.  The  limited  findings  also 
indicate  that,  in  general,  ratings  and  rankings  are  less  acceptable  than 
either  of  the  nomination  techniques. 

In  view  of  the  documented  and  likely  effects  of  various  situational 
factors  on  the  evaluation  process,  it  is  Important  that  the  researcher 
be  aware  of  potential  problems  in  the  use  of  peer  evaluations.  No  direct 
relationship  was  found  between  group  size  and  the  reliability  or  validity 
of  the  evaluations,  but  it  can  be  assumed  that  very  small  or  very  large 
groups  will  produce  less  relieUsle  and  less  valid  scores.  Group  struc- 
ture and  demographic  characteristics  were  found  to  be  sources  of  poten- 
tial difficulties.  With  respect  to  the  popular  issues  of  friendship, 
acquaintanceship,  and  type  of  personal  Interaction,  there  is  little 
evidence  that  these  have  a major  impact  on  the  validity  of  the  scores. 
Indications  are  that  all  techniques  are  relatively  impervious  to  a vari- 
ety of  situatlonul  factors,  the  nomination  technique  being  perhaps  the 
most  versatile. 
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0ns  possible  adjustment  In  future  work  with  this  tsohnlqus  Is  to 
begin  referring  to  It  as  associate  evaluation  rather  than  peer  evalua- 
tion. The  term  peer  evaluation > or  more  conmonly  peer  rating,  has  ac- 
quired overtones  of  meaning  and  often  has  a negative  connotation  among 
those  requlrud  to  perform  the  evaluations.  Moreover,  the  more  general- 
ized rubric  "associate  evaluation"  conceptually  embraces  more  Individuals; 
the  distinction  should  not  be  merely  semantic. 

In  brief,  peer  evaluations,  or  associate  evaluations,  have  been 
shown  to  be  fruitful  tools  In  both  research  and  application.  Several 
Issues  regarding  their  uee  remain  to  be  resolved,  but  there  is  suffi- 
cient evidence  to  suggest  that  these  Issues  can  be  resolved,  and  that 
they  do  not  detract  from  the  conclusion  that  associate  evaluations  are 
a very  powerful  tool  for  discv-imlnatlng  complex  human  behavior. 
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